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Abstract 

Low dimensional representations of words allow 
accurate NLP models to be trained on limited 
annotated data. While most representations ig¬ 
nore words’ local context, a natural way to in¬ 
duce context-dependent representations is to per¬ 
form inference in a probabilistic latent-variable 
sequence model. Given the recent success of 
continuous vector space word representations, 
we provide such an inference procedure for con¬ 
tinuous states, where words’ representations are 
given by the posterior mean of a linear dynam¬ 
ical system. Here, efficient inference can be 
performed using Kalman filtering. Our learn¬ 
ing algorithm is extremely scalable, operating 
on simple cooccurrence counts for both param¬ 
eter initialization using the method of moments 
and subsequent iterations of EM. In our exper¬ 
iments, we employ our inferred word embed¬ 
dings as features in standard tagging tasks, ob¬ 
taining significant accuracy improvements. Fi¬ 
nally, the Kalman filter updates can be seen as a 
linear recurrent neural network. We demonstrate 
that using the parameters of our model to ini¬ 
tialize a non-linear recurrent neural network lan¬ 
guage model reduces its training time by a day 
and yields lower perplexity. 

1. Introduction 

In many NLP applications, there is limited available la¬ 
beled training data, but tremendous quantities of unla¬ 
beled, in-domain text. An effective semi-supervised learn¬ 
ing technique is to learn word embeddings on the unlabeled 
data, which map every word to a low dimensional dense 
vector (Bengio et al., 2006; Mikolov et al., 2013; Penning- 
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ton et al., 2014), and then use these as features for super¬ 
vised training on the labeled data (Turian et al., 2010; Pas- 
sos et al., 2014; Bansal et al., 2014). Furthermore in many 
deep architectures for NLP, the first layer maps words to 
low-dimensional vectors, and these parameters are initial¬ 
ized with unsupervised embeddings (Collobert et al., 2011; 
Socher et al., 2013; Vinyals et al., 2014). 

Most of these methods embed word types, i.e., words in¬ 
dependent of local context, as opposed to work tokens, i.e., 
instances of words within their context. Ideally we would 
have a different representation per token. For example, de¬ 
pending on the context, “bank” is the side of a river or a 
financial institution. Furthemore, we would like such em¬ 
beddings to come from a probablistic sequence model that 
allows us to study the transition dynamics of text genera¬ 
tion in low dimensional space. 

We present a method for obtaining such context-dependent 
token embeddings, using a generative model with a vector¬ 
valued latent variable per token and performing posterior 
inference for each sentence. Specifically, we employ a 
Gaussian linear dynamical system (LDS), with efficient in¬ 
ference from a Kalman filter. To learn the LDS parameters, 
we use a two-stage procedure, initializing with the method 
of moments, and then performing EM with the approxi¬ 
mate second order statistics (ASOS) technique of Martens 
(2010). Overall, after taking a single pass over the train¬ 
ing corpus, the runtime of our approximate maximum- 
likelihood estimation (MLE) procedure is independent of 
the amount of training data since it operates on aggregate 
co-occurrence counts. Furthermore, performing inference 
to obtain token embeddings has the same time complexity 
as widely-used discrete first-order sequence models. 

We fit the LDS to a one-hot encoding of each token in the 
input text sequence. Therefore, the LDS is a mis-specified 
generative model, since draws from it are not proper indica¬ 
tor vectors. However, we embrace this multivariate Gaus¬ 
sian model instead of a continuous-state dynamical system 
with a multinomial link function because the Gaussian LDS 
offers several desirable scalability properties; (1) Kalman 
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filter inference is simple and efficient (2) using ASOS, the 
cost of our learning iterations does not scale with the corpus 
size, (3) we can initialize EM using a method-of-moments 
estimator that requires a single S VD of a co-occurrence ma¬ 
trix, (4) our M-step updates are simple least-squares prob¬ 
lems, solvable in closed form, (4) if we had used a multi¬ 
nomial link function, we would have performed inference 
using extended Kalman filtering, which makes a second- 
order approximation of the log-likelihood, and thus leads 
to a Gaussian LDS anyway (Ghahramani & Roweis, 1999), 
and (5) by using EM, we avoid stochastic-gradient-based 
optimization, which requires careful tuning for nonconvex 
problems. A naive application of our method scales to 
large amounts of training data, but not high-dimensional 
observations. In response, the paper contributes a variety 
of novel methods for scaling up our learning techniques to 
handle large input vocabularies. 

We employ our inferred token embeddings as features for 
part of speech (POS) and named entity recognition (NER) 
taggers. Eor POS, we obtain a 30% relative error reduction 
when applying a local classifier to our context-dependent 
embeddings rather than Word2Vec context-independent 
embeddings (Mikolov et al., 2013). When using our to¬ 
ken embeddings as additional features in lexicalized POS 
and NER taggers, which already have explicit features and 
test-time inference for context-dependence, we obtain sign- 
ficant gains over the baseline, performing as well as us¬ 
ing Word2Vec embeddings. We also present experiments 
demonstrating that the transition dynamics of the LDS cap¬ 
ture salient patterns, such as transforming first names into 
last names. 

Einally, the functional form of the Kalman filter update 
equations for our LDS are identical to the updates of a 
recurrent neural network (RNN) language model without 
non-linearities (Mikolov, 2012). A key difference between 
the LDS and an RNN, however, is that the LDS provides 
a natural backwards pass, using Kalman smoothing, where 
a token’s embedding depends on text to both the right and 
left. Drawing on the parallelism between filtering and the 
RNN, we use the LDS parameters, which can be estimated 
very quicky using our techniques, to initialize gradient- 
based optimization of a nonlinear RNN. This yields a signf- 
icant decrease in perplexity vs. the baseline RNN, and only 
requires 70% as many training epochs, saving 1 day on a 
single CPU core. 

2. Related Work 

We provide a continuous analog of popular discrete-state 
generative models used in NLP for inducing class mem¬ 
bership for tokens, including class-based language mod¬ 
els (Brown et al., 1992; Chelba & Jelinek, 2000) and induc¬ 
tion of POS tags (Christodoulopoulos et al., 2010). In par¬ 


ticular, Brown clusters (Brown et al., 1992) are commonly 
used by practioners with lots of unlabeled in-domain data. 

Our learning algorithm is very scalable because it oper¬ 
ates on aggregate count matrices, rather than individual to¬ 
kens. Similar algorithms have been proposed for obtaining 
type-level embeddings via matrix factorization (Penning¬ 
ton et al., 2014; Levy & Goldberg, 2014). However, these 
are context-independent and ignore the transition dynam¬ 
ics that link tokens’ embeddings. Lurthermore, they re¬ 
quire careful tuning of stochastic gradient methods. Previ¬ 
ous methods for token-level embeddings either use a rigid 
set of prototypes (Huang et al., 2012; Neelakantan et al., 
2014) or embed the token’s context, ignoring the token it¬ 
self (Dhillon et al.,2011). 

Lor learning discrete-state latent variable models, spectral 
learning methods also use count matrices, and thus are sim¬ 
ilarly scalable (Anandkumar et al., 2014). However, an 
LDS offers key advantages: we do not use third-order mo¬ 
ments, which are difficult to estimate, and we perform ap¬ 
proximate MLE, rather than the method of moments, which 
exhibits poor statistical efficiency. 

Recently, RNNs have been used to provide impressive re¬ 
sults in NLP applications including translation (Sutskever 
et al., 2014), language modeling (Mikolov et al., 2015), and 
parsing (Vinyals et al., 2014). We do not attempt to replace 
these with a Kalman filter, as we expect non-linearities are 
crucial for capturing long-term interactions and rigid, com¬ 
binatorial constraints in the outputs. However, RNNs train¬ 
ing can take days, even on GPUs, and requires careful tun¬ 
ing of stochastic gradient step sizes. Given the scalability 
of our parameter-free training algorithm, and our favorable 
preliminary results using the LDS to initialize a nonlin¬ 
ear RNN, we encourage further work on using linear latent 
variable models and the Gaussian approximations of multi¬ 
nomial data to develop sophisticated initialization methods. 
Already, practitioners have started using such techniques 
for initializing simple nonlinear deep neural networks us¬ 
ing the recommendations of Saxe et al. (2014). Einally, 
our work differs from Pasa & Sperduti (2014), who initial¬ 
ize an RNN using spectral techniques, in that we perform 
maximum-likelihood learning. We found this crucial for 
good performance in our NLP experiments. 

3. Background: Gaussian Linear Dynamical 
Systems 

We consider sequences of observations wi,..., Wn, where 
each Wi is a U-dimensional vector. A Gaussian LDS 
follows the following generative model (Kalman, 1960; 
Roweis & Ghahramani, 1999): 

xt =Axt-i+r] 

wt = Cxt + e, 


( 1 ) 

( 2 ) 
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where h < V is the dimensionality of the hidden states Xt 
and e ^ iV(0, D), rj ^ N{Q, Q). For simplicity, we assume 
xq is constant. 

The latent space for x is completely unobserved and we 
could choose any coordinate system for it while maintain¬ 
ing the same likelihood value. Therefore, without loss of 
generality, we can either fix A = I or Q = I, and we fix 
Q. Furthermore, note that the magnitude of the maximum 
eigenvalue of A must be no larger than 1 if the system is 
stable. We assume that the data we fit to has been centered, 
in which case the maximum eigenvalue is strictly less than 
1, since this implies Xt is asymptotically mean zero (inde¬ 
pendent of xq), so that Xt is also asymptotically mean zero. 

Finally, define the covariance at lag k to be 

'f'k ='E[wt+kwJ], ( 3 ) 

which is valid because we assume the data to be mean zero. 
Our learning algorithms require only a few (up to about 
fc = 10 in practice) as input. These matrices can be gath¬ 
ered using a single pass over the data, and their size does 
not depend on the amount of data. Furthermore, construct¬ 
ing these matrices can be accelerated by splitting the data 
into chunks, and aggregating separate matrices afterwards. 

3.1. Inference 

The Xt are distributed as a multivariate Gaussian under 
both the LDS prior and posterior (conditional on obser¬ 
vations w), so they can be fully characterized by a mean 
and variance. We use xt and St for the mean and covari¬ 
ance under the posterior for xt given wi.(^t-i), computed 
using Kalman filtering, and xt and St when considering 
the posterior for xt given all the data computed us¬ 
ing Kalman smoothing. In Appendix B.l we provide the 
full filtering and smoothing updates, which compute differ¬ 
ent means and variances for every timestep. Note that the 
updates require inverting aV x V matrix in every step. 

We employ the widely-used ’steady-state’ approximation, 
which yields substantially more efficient filtering and 
smoothing updates (Rugh, 1996). A key property of fil¬ 
tering and smoothing is that the updates to St and St 
do not depend on the actual observations, but only on 
the model’s parameters. Furthermore, they will converge 
quickly to time-independent ‘steady-state’ values. Define 
Si = to be the aymptotic limit of the co- 

variance St under the posterior for each xt given its history 
(at steady state, this is shared for all t). Flere, expectation 
is taken with respect to both time and the posterior for the 
latent variables. This satisfies 

El = ASiA^ Q, 

which can be solved for quickly using fixed point iteration. 
Similarly, we can solve for Eg = ¥\xtxj |wi:t]. Note that 


steady state, a property of the posterior, is unrelated to the 
stationary distribution of the LDS, which is unconditional 
on observations. 

Under, the steady state assumption, we can perform filter¬ 
ing and smoothing using substantially more efficient up¬ 


dates. We have; 

Xt = {A-KCA)xt-i + Kwt (4) 

xt = Jxt+i +{I - JA)xt (5) 

Here, the steady-state Kalman gain matrix is: 

a: = EiC^S'7/(6) 
where we define 

Sss = C^iC^ +D, (7) 


the unconditional prior covariance for w under the model. 
Note that {A — KCA) is data-independent and can be pre¬ 
computed, as can the smoothing matrix J = EgA^ (Ei)“^. 
For long sequences, steady-state filtering provides asymp¬ 
totically exact inference. However, for short sequences it is 
an approximation. 

3.2. Learning: Expectation-Maximization 

See Ghahramani & Hinton (1996) for a full exposition on 
learning the parameters of an LDS using EM. Under the 
steady-state assumption, the M step requires; 

E[xtx7], E[xtxJ_^_i], E[xtu;7], (8) 

where the expectation is taken with respect to time and the 
posterior for the latent variables. This can be computed us¬ 
ing Kalman smoothing and then averaging over time. The 
M step can then be done in closed form, since it is solving 
least-squares regressions for Xt+i against xt and wt against 
Xt to obtain A and C. Lastly, D can be recovered using: 

D = q/g — CE [xtwj ] 

- E [wtxj] + CE [xxj] (9) 

3.3. Learning: EM with ASOS (Martens, 2010) 

EM requires recomputing the second order statistics (8) 
in every iteration. While these can be computed using 
Kalman smoothing on the entire training set, we are in¬ 
terested in datasets with billions of timesteps. Eortunately, 
we can avoid smoothing by employing the ASOS (approx¬ 
imate second order statistics) method of Martens (2010), 
which directly performs inference about the time-averaged 
second order statistics. 

Under the steady-state assumption, this is doable because 
we can recursively define relationships between second or¬ 
der statistics at lag k and at lag k + 1 using the recursive 
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relationships of the underlying dynamical system. Namely, 
rather than performing posterior inference by recursively 
applying the linear operations (4) and (5), and then averag¬ 
ing over time, we switch the order of these operations and 
apply the linear operators to time-averaged second order 
statistics. For example, the following equality is an im¬ 
mediate consequences of the filtering equation (4) (where 
expectation is with respect to t and the posterior for x): 

E[xlwJ ] = (A - KCA)E[xlzlwJ ] + KE[wtwJ] (10) 

ASOS uses a number of such recursions, along with meth¬ 
ods for estimating covariances at a time horizon r. These 
covariances can be approximated by assuming that they are 
exactly described by the current estimate of the model pa¬ 
rameters. Therefore, unlike standard EM, performing EM 
with ASOS allows us to precompute an empirical estimate 
of the T'fc at various lags (up to about r = 10) and then 


Algorithm 1 Learning an EDS for Text 

Input: 

Text Corpus, approximation horizon r (e.g., 10) 

Output: 

EDS parameters and filtering matrices: (A, C, D, K, J) 

Gather the matrices 4'^ = ] {k < r) 

_ 1 

VE T'g ^ (diagonal whitening matrix) 

T'fc ^ (whitening) 

Params ^ Subspace_lD(T'o,..., 4'^.) 

Params ^ ASOS_EM(Params,4>o,..., 'I'r) 


x-w covariance at lag 0: G = ] and the h x (rV) 

matrix = [A^-^G A^-'^G ...AGG]. 

Next, define the Hankel matrix 


never touch the data again. Eurthermore, (Martens, 2010) 


/ 

4'r 

4'r-l 

4'r-2 • 

• 4'i 

\ 

demonstrates that the ASOS approximation is consistent. 

Hr = 


4'r+l 

4'r 

4'r-l • 

. 4'2 


Namely, the error in approximating the time-averaged sec¬ 








ond order statistics vanishes with infinite data when evalu¬ 


1 

4'2r-l 

4'2r-2 

4'r-3 • 

• 4'r 

/ 

ated at the MLE parameters. Overall, ASOS scales linearly 








with r and the cost of multiplying by the T'fc. 

Then, we have Hr = 

F^Ar. 





( 11 ) 


3.4. Learning: Subspace Identification 

We initialize EM using Subspace Identification (SSID), a 
family of method-of-moments estimators that use spectral 
decomposition to recover EDS parameters (Van Overschee 
& De Moor, 1996). The rationale for such a combination 
is that the method of moments is statistically consistent, so 
performing it on reasonably-sized datasets will yield pa¬ 
rameters in the neighborhood of the global optimum, and 
then EM will perform local hill climbing to find a local op¬ 
timum of the marginal likelihood. Eor EDS, this combina¬ 
tion yields empirical accuracy gains in Smith et al. (1999). 
A related two-stage estimator, where the local search of EM 
is replaced with a single Newton step on the local likeli¬ 
hood surface, is known to be minimax optimal, under cer¬ 
tain local asymptotic normality conditions (Le Cam, 1974). 

Eor our particular application, we use SSID as an approx¬ 
imate method, where it is not statistically consistent, due 
to the mis-specification of fitting indicator-vector data as a 
multivariate Gaussian. In our experiments, we discuss the 
superiority of SSIDh-EM rather than just SSID. We do not 
present results using EM initialized randomly rather than 
with SSID, since we found it very difficult for our high di¬ 
mensional problems to generate initial parameters that al¬ 
lowed EM to reach high likelihoods. 

We employ the ‘n4sid’ algorithm of Van Overschee & 
De Moor (1994). Define r to be a small integer. Define the 
(rV) X h matrix F^ = [G ; GA ; GA^ ; ... ; GA’-i], 
where denotes vertical concatenation. Also define the 


Let {U, S, V) result from a rank-/i SVD of an empiri¬ 
cal estimate of Hr, from which we set F^ = US^ and 
Ar = S '2 V^. To recover the EDS parameters, we first de¬ 
fine to be the submatrix of A^ corresponding to 

the first (r — 1) blocks. Similarly, define A^-'’. From the 
definition of A^, we have that AA^-’' = so we 

can estimate A as A = Ar'^"" ^^(A^''’)+. Next, one can 
read off an estimate for G as the first block of F^. Alter¬ 
natively, since the previous step gives us a value for A one 
can set up a regression problem similar to the previous step 
to solve for G by invoking the block structure in F,,. 

Finally, we need to recover the covariance matrix D. We 
first find the asymptotic latent covariance Ei using fixed- 
point iteration Ei = AEiA^ -f Q. From this, we set D 
using a similar update as (12), which uses statistics of the 
EDS posterior, except here Ei is unconditional on data and 
is purely a function of the EDS parameters. 

D = 4'o-GEiG^. (12) 

4. Linear Dynamical Systems for Text 

We fit an EDS to text using SSID to initialize EM, where 
the E step is performed using ASOS. A summary of the 
procedure is provided in Algorithm 1. SSID and ASOS 
scale to extremely large training sets, since they only re¬ 
quire the T'fc matrices, for small k. However, they can not 
directly handle the very high dimensionality of text obser¬ 
vations (vocabulary size V « 10^). In this section, we first 
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describe particular properties of the data distribution. Then, 
we describe novel techniques for leveraging these proper¬ 
ties to yield scalable learning algorithms. 

Define wt as an indicator vector that is 1 in the index of 
the word at time t and define to be the corpus frequency 
of word type i. We fit to the mean-zero observations Wt = 
Wt — p. Note that the LDS will not generate observations 
with the structure of a one-hot vector shifted by a constant 
mean, so we cannot use it directly as a generative language 
model. On the other hand, we can still fit models to training 
data with this structure, perform posterior inference given 
observations, assess the likelihood of a corpus, etc. In our 
experiments, we demonstrate the usefulness of these in a 
variety of applications. We have; 

T'o = E[wtw7] = - pp^ = diag(p) - pp^, 

(13) 

while at higher lags, 

T'fc = E[wtwJ+k] = (14) 

Approximating these covariances from a length-T corpus: 

Pi = Ex[wt] = ;^#(word i appears), (15) 
where #() denotes the count of an event. We also have 


the data to this subspace breaks the special structure de¬ 
scribed above, so we instead work in and perform pro¬ 
jections onto 1*^ implicitly as-needed. Fortunately, both 
SSID and EM find C that lies in the column space of the 
data, so iterations of our learning algorithm will maintain 
that 1 ^ col(C). In Appendix A.2, we describe how to han¬ 
dle this rank deficiency when computing the Kalman gain. 

Note that we could have used pretrained type-level embed¬ 
dings to project our corpus and then train an LDS on low¬ 
dimensional dense observations. However, this is vulnera¬ 
ble to the subspace of the type-level embeddings, which are 
not trained to maximize the likelihod of a sequence model, 
and thus might not capture proper syntactic and semantic 
information. We will release the code of our implemen¬ 
tation. SSID requires simple scripting on top of a sparse 
linear algebra library. Our EM implementation consists of 
small modifications to Martens’ public ASOS code. 

4.1. Scalable Spectral Decomposition 

SSID requires a rank-/i S VD of the very large block Hankel 
matrix Hr (11). We employ the randomized approximate 
SVD algorithm of Halko et al. (2011). To factorize a ma¬ 
trix X, this requires repeated multiplication by X and by 
X^. All the submatrices in Hr are sparse-minus-low-rank, 
so we handle the sparse and low-rank terms individually 
within the multiplication subroutines. 


= (16) 
—#(word i appears with word j k positions to the right). 

Eor real-world data, (16) will be extremely sparse, with 
the number of nonzeros substantially less than both and 
the length of the corpus. The fact that (14) is sparse-minus- 
low-rank and (13) is diagonal-minus-low-rank is critical for 
scaling up the learning algorithms. Eirst of all, we do not 
instantiate these as V x V dense matrices, but operate di¬ 
rectly on their factorized structure. Second, in Sec. 4.2 we 
show how the structure of (13) allows us to model full-rank 
V xV noise covariance matrices implicitly. Strictly speak¬ 
ing, the number of nonzeros in (16) will increase as the cor¬ 
pus size increases, due to heavy-tailed word co-occurence 
statistics. However, this growth is sublinear in T and can 
be mitigated by ignoring rare words. 

Unfortunately, each 'kfc is rank-deficient. Not only is 
E)^*] = 0, but also the sum of every wt is zero (because Wt 
is a one-hot vector and ^ is a vector of word frequencies). 
Define 1 to be the length-U vector of ones. Then, our data 
lives on 1^, the d — I dimensional subspace, orthogonal 
to 1. Doing maximum likelihood in instead of l"'^ will 
lead to a degenerate likelihood function, since the empir¬ 
ical variance in the 1 direction is 0. However, projecting 


4.2. Modeling Full-Rank Noise Covariance 

The noise covariance matrix D is V x V, which is un¬ 
manageably large for our application, and thus it is rea¬ 
sonable to employ a spherical D = dl or diagonal 
D = diag(di,..., dy) approximation. Eor our prob¬ 
lem, however, we found that these approximations per¬ 
formed poorly. Because of the property Wt = 0, off- 
diagonal elements of D are critical for modeling the anti¬ 
correlations between coordinates. This would have been 
captured if we passed wt through a logistic multinomial 
link function. However, this prevents simple inference us¬ 
ing Kalman filter. To maintain conjugacy, practitioners 
sometimes employ the quadratic upper bound to a logis¬ 
tic multinomial likelihood introduced in Bohning (1992), 
which hard-codes the coordinate-wise anticorrelations via 
D — ^ I — . However, we found this data- 

independent estimator performed poorly. 

Instead, we exploit a particular property of the SSID and 
EM estimators for D in (12) and (9). Namely, both set D to 
T'o minus a low-rank matrix, and thus D is diagonal-minus- 
low-rank, due to the structure in (13). Eor the LDS, we 
mostly seek to manipulate the precision matrix D~^. While 
instantiating this dense V x V matrix is infeasible, mul¬ 
tiplication by and evaluation of det(i9“^) can both 
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be done efficiently using the ShermanWoodbury-Morrison 
formula (Appendix B.2). In Appendix A.3, we also lever¬ 
age the formula to efficiently evaluate the training likeli¬ 
hood. These uses of the formula differ from its common 
usage for LDS, when not using the steady-state assumption 
and the posterior precision matrix needs to be updated us¬ 
ing rank one updates to the covariance. Our technique is 
particular to fitting indicator-vector data as a multivariate 
Gaussian. 

4.3. Whitening 

Before applying our learning algorithms, we first whiten 
the T' matrices with the diagonal transformation. 

W = vI/o"" =diag(/r7^...,p7). (17) 

Fitting to rather than maintains the 

data’s sparse-minus-low-rank and diagonal-minus-low- 
rank structures. Furthermore, EM is unaffected, i.e., apply¬ 
ing EM to linearly-transformed data is equivalent to learn¬ 
ing on the original data and then transforming post-hoc. 

On the other hand, the SSID output is affected by whiten¬ 
ing, since the squared reconstruction loss that SVD im¬ 
plicitly minimizes depends on the coordinate system of 
the data. We found such whitening crucial for obtaining 
high-quality initial parameters. Whitening for SSID, which 
is recommended by Van Overschee & De Moor (1996), 
solves a very similar factorization problem as canoni¬ 
cal correlation analysis between words and their contexts, 
which has been used successfully to learn word embed¬ 
dings (Dhillon et al., 2011; 2012) and identify the parame¬ 
ters of class-based language models (Stratos et al., 2014)). 

In Appendix A. 1 we also provide an algorithm, which re¬ 
lies on whitening, for manually ensuring the D returned by 
SSID is PSD, without needing to factorize a V x V matrix. 
Such manual correction is unnecessary during EM, since 
the estimator (9) is guaranteed to be PSD. 

5. Embedding Tokens using the LDS 

The only data-dependent term in the steady-state filtering 
and smoothing equations (4) and (5) is Kwt- Since wt 
can take on only V possible values, we precompute these 
word-type-level vectors. The computational cost of filter¬ 
ing/smoothing a length T sequence is 0{Th?), which is 
identical to the cost of inference on a discrete first-order 
sequence model. (6) is not directly usable to obtain K, due 
to the data’s rank-deficiency, and we provide an efficient 
alternative in Appendix A.2. This also requires the matrix 
inversion lemma to avoid instantiating S~^ in (6). 

In our experiments we use the latent space to define fea¬ 
tures for tokens. However, distances in this space are not 
well-defined, since the likelihood is invariant to any lin¬ 
ear transformation of the latent variables. To place xt in 


reasonable coordinates, we compute the empirical poste¬ 
rior covariance M = E[a:a;^] on the training data (using 
ASOS). Then, we whiten Xt using M ~2 and project the 
result onto the unit sphere. 

6. Relation to Recurrent Neural Networks 

We now highlight the similarity between the parametriza- 
tion of an RNN architecture commonly used for language 
modeling and our Kalman filter. This allows us to use our 
LDS as a novel method for initializing the parameters of 
a non-linear RNN, which we explore in Sec. 7.4. Eollow- 
ing (Mikolov, 2012) we consider the network structure; 

ht = a{Aht-i +Bwt-i) (18) 

Wt ~ SoftMax(C'/it), (19) 

Here, we employ the SoftMax transformation of a vector v 
as Vi — >■ exp(wi)/exp(ufc). The coordinate-wise non¬ 
linearity (t(-) is, for example, a sigmoid, and the network is 
initialized with some fixed vector ho- 

Consider the use of the steady-state Kalman filter (4) as an 
online predictor, where the mean prediction for wt is given 
by Cxt- Then, if we replace a and SoftMax with the iden¬ 
tity, the Kalman filter and the RNN have the same set of pa¬ 
rameters, where we B corresponds to K and A corresponds 
to {A — KCA). In terms of the state dynamics, the LDS 
may provide parameters that are reasonable for a nonlinear 
RNN, since the sigmoid a has a regime for inputs close to 
zero where it behaves like the identity. A linear approxi¬ 
mation of SoftMax() ignores mutual exclusivity. However, 
we discuss in Section 4.2 that using a full-rank D captures 
some coordinate-wise anti-correlations. Also, (19) does not 
affect the state evolution in (18). 

A key difference between the LDS and the RNN is that the 
LDS provides a backwards pass, using Kalman smoothing, 
where Xt depends on words to the right. Eor RNNs, this 
would requires separate model (Schuster & Paliwal, 1997). 

7. Experiments 

7.1. LDS Transition Dynamics 

Many popular word embedding methods learn word-to- 
vector mappings, but do not learn the dynamics of text’s 
evolution in the latent space. Using the specific LDS model 
we describe in the next section, we employ the transition 
matrix A to explore properties of these dynamics. Because 
the state evolution is linear, it can be studied easily using 
a spectral decomposition. Namely, A converts its left sin¬ 
gular vectors into (scaled) right singular vectors. Eor each 
vector, we find the words most likely to be generated from 
this state. Table 1 presents these singular vector pairs. We 
find they reflect interpretable transition dynamics. In all 
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Right Singular Vector 

Left Singular Vector 

islamist lebanese Israeli 
Palestinian british latin 
Japanese shiite greek 

territories immigrants sea 
films communities nationals 
rivals africa clients 

chris mike Steve 

Jason tim Jeff 
bobby ian greg 

evans anderson harris 
robinson smith phillips 
Collins murray murphy 

singh berlusconi sharon 
blair putin abbas 
netanyahu brown levy 

shares referee suggesting 
industries testified insisted 
adding arguing yesterday 

tampa Colorado minnesota 
detroit Cleveland phoenix 
Indiana Seattle dallas 

bay derby division 
county sox district 
sole river valley ballet 

policemen helicopters soldiers 
suspects demonstrators guards 
Iraqis personnel detainees 

remained expressed outst 
recommended remains feels 
gets resumed sparked 

salt chicken pepper 
chocolate butter cheese 
cream sauce bread 

chicken cream pepper 
sauce cheese chocolate 
salt butter bread 


Table 1. Words likely to be generated for singular vector pairs of 
the LDS transition operator. The operator maps right vectors to 
left, and the pairs are syntactically and semantically coherent. 


but the last block, the vectors reflect strict state transitions. 
However, in the last block contains topical terms about food 
invariant under A. Overall, we did not And such salient 
structure in the parameters estimated using SSID. 

7.2. POS Tagging 

Unsupervised learning of generative discrete state models 
for text has been shown to capture part-of-speech (POS) 
information (Christodoulopoulos et al., 2010). In response, 
we assess the ability of the LDS to also capture POS strac- 
mre. Token embeddings can be used to predict POS in two 
ways: (1) by applying a local classifier to each token’s em¬ 
bedding, or (2) by including each token’s embedding as ad¬ 
ditional features in a lexicalized tagger. For both, we train 
the tagging model on the Penn Treebank (PTB) train set, 
which is not included for LDS training. Token embeddings 
are obtained from Kalman smoothing. We evaluate tagging 
accuracy on the PTB test set using the 12 ‘universal’ POS 
tags (Petrov et al., 2011) and the original tags. We contrast 
the LDS with type embeddings from Word2Vec, trained on 
the LDS data (Mikolov et al., 2013). 

We At our LDS using a combination of the APNews, New 
York Times, and RCVl newswire corpora, about IB tokens 
total. We maintain punctuation and casing of the text, but 
replace all digits with “NUM’ and all but the most 200k 
frequent types with “OOV.” We employ r = 4 for SSID, 
r = 7 for EM, and h = 200. We add 1000 psuedocounts 
for each type, by adding to each coordinate of /i. 

The LDS hyperparameters were selected by maximizing 
the accuracy of a local classifier on the PTB dev set. 
This also included when to terminate EM. Eor Word2Vec, 
we performed a broad search over hyperparameters, again 


tags 

W2V 

SSID 

EM 

Lex 

Lex 

-l-EM 

Lex 

-I-W2V 

U 

95.00 

89.26 

96.44 

97.97 

98.05 

98.02 

P 

92.58 

83.00 

94.30 

97.28 

97.32 

97.35 


Table 2. POS tagging with universal tags (U) and PTB tags (P). 
maximizing for local POS tagging. Our local classifier was 
a two-layer neural network with 25 hidden units, which 
outperformed a linear classifier. The best Word2Vec con¬ 
figuration used the CBOW architecture with a window 
width of 3. The lexicalized tagger’s hyperparameters were 
also tuned on the PTB dev set. Eor the local tagging, we 
ignored punctuation and few common words types such as 
“and” in training. Instead, we classified them directly using 
their majority tag in the training data. 

Overall, we found that the LDS and Word2Vec took about 
12 hours to train on a single-core CPU. Since the Word2Vec 
algorithm is simple and the code is heavily optimized, it 
performs well, but our learning algorithm would have been 
substantially faster given a larger training set, since the 
matrices can be gathered in parallel and the cost of SSID 
and ASOS is sublinear in the corpus size. In Section 7.4, 
training the LDS is order of magnitude faster than an RNN. 

Our results are shown in Table 2. Left to right, we com¬ 
pare Word2Vec (W2V), SSID, EM initialized with SSID 
(EM), our baseline lexicalized tagger (LEX), the lexical¬ 
ized tagger with extra features from LDS token embed¬ 
dings (LEX + EM), and the lexicalized tagger with type- 
level Word2Vec embeddings (LEX + SSID). 

The first 3 columns perform local classification. Eirst, 
while SSID is crucial for EM initialization, we found it 
performed poorly on its own. However, EM outperforms 
Word2Vec substantially. We expect this is because the LDS 
explicitly maximizes the likelihood of text sequences, and 
thus it forces token embeddings to capture the transition 
dynamics of syntax. All differences are statistically signif¬ 
icant at a .05 significance level using the exact binomial 
test. In Appendix C, we demonstrate the importance of 
SSID vs. random initialization. The final 3 columns use a 
carefully-engineered tagger. For universal tags, LDS and 
Word2Vec both contribute a statistically-signiflcant gain 
over the baseline (Lex), but their difference is not signif¬ 
icant. For PTB tags, we And that Word2Vec achieves a sig- 
niflcant gain over LEX, but the LDS does not. We expect 
that our context-dependent embeddings perform as well 
as context-independent embeddings since the taggers’ fea¬ 
tures and test-time inference capture non-local interactions. 

7.3. Named Entity Recognition 

In Table 3 we consider the effect of unsupervised token fea¬ 
tures for NER on the Conll 2003 dataset using a lexicalized 
tagger (Lex). We use the same LDS and Word2Vec models 
as in the previous section, and also compare to the Brown 
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clusters used for NER in Ratinov & Roth (2009). As be¬ 
fore, we find that Word2Vec and LDS provide significant 
accuracy improvements over the baseline. We expect that 
the reason the LDS does not outperform Word2Vec is that 
NER relies mainly on performing local pattern matching, 
rather than capturing long-range discourse structure. 


set 

Lex 

Lex-l-Brown 

Lex-HW2V 

Lex-l-LDS 

dev 

93.90 

93.79 

94.14 

94.21 

test 

89.34 

89.76 

90.00 

89.9 


Table 3. NER with various unsupervised token features 


7.4. RNN initialization 

As highlighted in Section 2, RNNs can provide impres¬ 
sive accuracy in various applications. We consider the 
simple RNN architecture of Sec. 6, since it permits natu¬ 
ral initialization with an LDS and because Mikolov et al. 
(2015) demonstrate that small variants of it can outperform 
LSTMs as a language model (EM). Note that the ‘context 
units’ of Mikolov et al. (2015) could also be learned us¬ 
ing our EM procedure, by restricting the parametrization 
of A. We leave exploration of hierarchical softmax obser¬ 
vations (Mnih & Hinton, 2009), and other alternative archi¬ 
tectures, for future work. 

We evaluate the usefulness of the LDS for initializing the 
RNN under two criteria: (1) whether it improves the per¬ 
plexity of the final model, and (2) whether it leads to faster 
optimization. A standard dataset for comparing language 
models is the Penn Treebank (PTB) (Sundermeyer et al., 
2012; Pachitariu & Sahani, 2013; Mikolov et al., 2015). We 
first train a baseline, obtaining the same test set perplex¬ 
ity as Mikolov (2012), with 300 hidden dimensions. This 
initializes parameters randomly, with lengthscales tuned as 
in Mikolov (2012). Next, we use the LDS to initialize an 
RNN. In order to maintain a fair comparison vs. the base¬ 
line, we train the LDS on the same PTB data, though in 
practice one should train it on a substantially larger corpus. 

We use the popular RNN learning rate schedule where it 
is constant until performance on held-out data fails to im- 



Figure 1. RNN Dev set perplexity vs. # passes over the training 
data for baseline initialization vs. LDS initialization. 



Baseline 

LDS 

dev 

130.6 

128.1 

test 

124.0 

122.8 


Table 4. Final perplexity for an RNN language model trained us¬ 
ing random parameter initialization vs. LDS initialization, 
prove, and then it is decreased geometrically until the held- 
out performance again fails to improve (Mikolov, 2012). 
We tuned the initial value and decay rate. When initializing 
with the LDS, small learning rates are crucial: otherwise, 
the optimization jumps far from where it started. 

In Ligure 1, we plot perplexity on the dev set vs. the num¬ 
ber of training epochs. The time to train the LDS, about 30 
minutes, is inconsequential compared to training the RNN 
(4 days) on a single CPU core. LDS training on the PTB 
is faster than our experiments above with IB tokens be¬ 
cause we use a small vocabulary and run far fewer EM it¬ 
erations, in order to prevent overfitting. The RNN base¬ 
line converged after 17 training epochs, while using the 
LDS for initialization allowed it to converge after 12, which 
amounts to about a day of savings on a single CPU core. 
Next, in Table 4 we compare the final perplexities on the 
dev and test sets. We find that initializing with the LDS 
also provides a better model. 

We found that initializing the RNN with LDS parameters 
trained using SSID, rather than SSIDh-EM, performed no 
better than the baseline. Specifically, the best performance 
was obtained using a high initial learning rate, which al¬ 
lows gradient descent to ignore the SSID values. We ex¬ 
pect this is because the method of moments requires lots of 
data, and the PTB is small. In a setting where one trains 
the LDS on a very large corpus, it is possible that SSID is 
effective. Overall, we did not explore initializing the RNN 
using type-level embeddings such as Word2Vec, since it is 
unclear how to initialize A and how to set K C. 

8. Conclusion and Future Work 

We have contributed a scalable method for assigning 
word tokens context-specific low-dimensional representa¬ 
tion that capture useful syntactic and semantic structure. 
Our algorithm requires a single pass over the training data 
and no painful tuning of learning rates. 

Next, we will extend ASOS to new models and improve 
initialization for alternative RNN architectures, including 
hierarchical softmaxes, by leveraging not just the LDS pa¬ 
rameters, but also the LDS posterior on the training data. 
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Supplementary Material 

A. Scaling UP LDS Learning to Text 

As discussed in Section 4.3, we whiten our data using 

W = (20) 

Besides improving the empirical performance of SSID, working in the whitened coordinate system also simplifies various 
details used in Section 4 when scaling up LDS learning for text. Under this transformation, we have tko = diag(/r) — 

This simplifies various steps because our estimators (12) and (9) are of the form I — [low rank matrix], rather than 
diag(p) — [low rank matrix]. In the whitened coordinates, the data are orthogonal to ps, rather than 1. 

A.l. Recovering PSD D in SSID 

While SSID is consistent, for finite data the procedure is not guaranteed to yield a positive semidefinite (PSD) estimate for 

D, which is required because it is a covariance matrix. In our particular case, the D we seek will be singular on the span 
1 1 -L 

of p 2 , but Subspace ID will still not guarantee that D will be PSD on /r s . 

This is critical because if D is not PSD on this subspace, then we can not define a valid Kalman filtering procedure for the 
model (see Sec. A.2). However, due to the structure of our data distribution, D can easily be fixed post-hoc. 

From (12) we have the estimator 




( 21 ) 


Next, define Da = I — ~ (1 ~ a)CTiiC^ and define the PSD estimator D' = Dag, where ao is the minimal value 

i -L 

such that Da is PSD on p .2 . We next show how to find ag. 

We have that Da is PSD on /xs iff the maximum eigenvalue of (1 — a)CTiiC^ is less than 1. This is because is a 

unit vector and we can ignore any cross terms between ps and (1 — q;)C'SiC'^ because col(C) = /xs , which is true 
because the data lies in this subspace. Therefore we can find ag using the following procedure: 

1. Find the maximal eigenvalue of CEiC^, using power iteration. This can be done efficiently by keep CSiC^ in 
its factorized form and not instantiating aV x V matrix. 

2. If So < set ao = 0- Otherwise, set ao = . 

A.2. Efficiently Computing the Kalman Gain Matrix 

Next, recall our expression (6) for the steady state Kalman gain K = 

KSss = ^iC^, 

where 

= CEiC^ + D 


which comes from solving the system 

( 22 ) 

(23) 


Furthermore, note that both of our estimators for D, (12) and (9), maintain the property that /X 2 is an eigenvector of 
eigenvalue 0 for D. 

Since /x5 is also orthogonal to co^C), we have that ps ^ Col(5'ss)- Therefore, we cannot use (6) directly because Sgs 
is not invertible along this direction. However, we can still solve (22) as K = EiC^S'+. This pseudoinverse can be 
characterized as: 


Sts = [inversion of Sgs within col(S'ss)] [projection onto col(S'ss)] 


(24) 
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Furthermore, note that both estimators for D have the form that 

D = — (PSD, low rank, and -L /r®) (25) 

= / —— (PSD, low rank and -L/r®) (26) 

1 1T 

:=/ ——L (27) 

Therefore, it remains to define the pseudoinverse of 

5'^^ = /-^5^5^+C'(Ei-M)C'^). (28) 

Furthermore, since col(L) = col(C') = , we can define L = CMC^ for some positive definite M, so we consider 

= /-/r5pr+C(Ei-M)C^). (29) 

Observe that 

{I+ C{T.i-M)C^)-^ (30) 

is a valid inverse for Sss on /r? . This follows from the orthogonality of and col(C'), so we can effectively ignore the 
1 1 J- 

/i 2 term in (29) when inverting it on p. 2 . 

Therefore, we employ 

{Sss)+ = (/ + C(Si-M)CT)-i(/-p2^i), (31) 

1 -L 

where the right term is an orthogonal projection onto ps . 

The term in the inverse (31) is diagonal-plus-low-rank and can be manipulated efficiently using the matrix inversion lemma 
formula (53): 

(/ -f ^(Ei - M)C^)-^ = 1- ^((Ei - M)-^ + Ccy'^C^. (32) 

Therefore we can obtain K without instantiating an intermediate matrix of size V x V. 

Recall the filtering equation (4): 

xl = (A — KCA)xlZi + Kwt- 

We seek to avoid any 0(V) (or worse) computation at test time when filtering. First of all, we can precompute (A — KCA). 
For the second term, there are only V possible values for the unwhitened input wt = Wt—IJ-, so we would like to precompute 


KW{wt — p) for every possible value that the indicator wt can take on. Let wt = Ci, we have: 

^(tht-p) = EiCT5ilL(ei-p) (33) 

= EiCT(/ + C(Ei -M)CT)-i(/-p2pi^)(VFe,-p5) (34) 

= EiC'T(/ -f C'(Ei - M)C^)-^{I - (35) 

= EiC'T(/-f ^(Ei(36) 
= [EiC^(/ + C(Ei-M)C'^)-ifL]^, (37) 

(38) 


In the final line, the subscript i denotes the zth column of a matrix. 
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A.3. Likelihood Computation 

Sss is also used when computing the log-likelihood of input data (wi,..., wt)- 

1 ^ 

LL = -TC log(27r) - - logdet(5'ss) + - WtY- Wt). (39) 

Here, = CAxt, where xt is the posterior mean for xt given observations Sss is only invertible along /i5 , 

but — Wt) varies only on this subspace, so we can effectively ignore the zero-variance direction . Therefore, we 


just use (30) as Sss^ in (39). 

For the data-dependent term in our likelihood, we have; 

- I - ^t) (40) 

= -Y^r ('S'7s^Et[(u;P‘^ - u;t)(wP‘^ “ (41) 

= -^tr - CAxt)iwt - CAxt)^]) (42) 

= ^ {tr (^-'EtKzu^]) - 2tr {S-s^Et[wtxJ]A^C^) + tr {S-^CAEt[xtxJ]A^C^)) (43) 

= ^ {tr {S7s^l) - 2tr + tr {S7s^CAEt[xtxJ]A^C^)) (44) 


Note that the Et\xtxJ ] term above is different from Si, since the former is from the posterior distribution given the input 
data and Si is from the prior. 

The first term can be computed using (57). The latter two terms are of the form tr where Z and W are both 

V X k, so we can invoke (58). For the logdet(S'ss) term, we consider Sss only on , so we compute — log det(S'“^), 
where 5“^ comes from (30) and we employ the formula (55). 

B. Background 

B.l. Non-Steady-State Kalman Filtering and Smoothing 

We will use xj and SJ for the mean and variance under the posterior for Xt given Wi-r- We will use Xt and Sf when 
considering the posterior for xt given all the data wi-,t- The following are the forward ‘filtering’ steps (Kalman, 1960; 
Ghahramani & Hinton, 1996): 


xl ^ = Axl_l (45) 

= ASlzlA^ + Q (46) 

Kt = Si-^ C (C'S'‘Zi^ +D)-^ (47) 

x\= xl_^+ Kt{wt- Cxl-^) (48) 

S'*-! = S‘"^ - KtCSl-^ (49) 

Next, we have the backwards ‘smoothing’ steps: 

Jt-i = S‘ziA'(S‘-i)-! (50) 

xt-i = x\z\ +- Ax\z\) (51) 

C1 = Slzl + Jt -1 (57 - S*-1) J7_ 1 (52) 


Note that the updates for the variances S are data-independent and just depend on the parameters of the model. They will 
converge quickly to time-independent ‘steady state’ quantities. 



A Linear Dynamical System Model for Text 


B.2. Matrix Inversion Lemma 

Following Press et al. (1987), we have 


{A + USV^)-^ = A-^ - A-^U(S-^ + A-^U)-^V^A-^ (53) 

and the related expression for determinants: 

det(y4 + USV^) = det{S) det(y4) det)^-^ + V^A-^U). (54) 

i.e. 

logdet(A + USV^) = logdet(5') + logdet(A) + logdet(S'“^ + A~^U). (55) 

Expression (53) is useful if we already have an inverse for A and want to efficiently compute the inverse of a low-rank per¬ 
turbation of A. It is also useful in order to be able to do linear algebra using {A + U without actually instantiating 
aV xV matrix, which can be unmanageable in terms of both time and space for large V. For example, let M he an V xm 
matrix with m « V, then we can compute M{A + USV^)~^ using (53) by carefully placing our parentheses such that 
no V X V matrix is required. In our application, A is diagonal, so computing its inverse is trivial. Also, note that (53) can 
be used recursively, if A is defined as another sum of an easily invertible matrix and a low rank matrix. 

Along these lines, here are a few additional useful identities that follow from (53) for quantities that can be computed 
without time or storage. Here, we assume that both A~^ and tr{A~^) can be computed inexpensively (e.g., A is 
diagonal). 

For any product XY^, where X and Y are V x k matrices, note that we can compute tr{XY"'") in 0{Vk) time as 

tr{XY^) = EE X,,Y,j. (56) 

i 3 

We can use this to compute the trace of the inverse of a matrix implicitly defined via the matrix inversion lemma: 


tr [(A -f USV^)-^] = fr(A-i) - tr 


A-^U{S-^ + A-^U)-^ 

X - 


(57) 


More generally. Let Z and W heV x k matrices, then we compute 


tr [(A -f USV^)-^ZW^] = tr{^^^) - tr 

X yT 


A-^U{S-^ 


+ V^A-^U)-^ 

X 


V^A-^ZW^ 

-V-^ 


We use (58) when computing the Likelihood in Section A.3. 


(58) 


C. SSID Initialization vs. Random Initialization 

In Figure 2, we contrast the progress of EM, in terms of the log-likelihood of the training data, when initializing with SSID 
vs. initializing randomly (Random). Note that the initial values of SSID and Random are nearly identical. This is due to 
model mispecification, and the fact that we chose the lengthscales of the random parameters post-hoc, by looking at the 
lengthscales of the SSID parameters. Over the course of 100 EM iterations, the model initialized with SSID climbs quickly 
and begins leveling out, whereas it takes a long time for the Random model to begin climbing at all. We truncate at 100 
EM iterations, since we actually use the S SID-initialized model after the 50th iteration. After that, we find that local POS 
tagging accuracy diminished. 
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Iterations 


Figure 2. EM Log-Likelihood vs. training iterations for random initialization and SSID initialization. 








